84 research outputs found
Association Rules Mining Based Clinical Observations
Healthcare institutes enrich the repository of patients' disease related
information in an increasing manner which could have been more useful by
carrying out relational analysis. Data mining algorithms are proven to be quite
useful in exploring useful correlations from larger data repositories. In this
paper we have implemented Association Rules mining based a novel idea for
finding co-occurrences of diseases carried by a patient using the healthcare
repository. We have developed a system-prototype for Clinical State Correlation
Prediction (CSCP) which extracts data from patients' healthcare database,
transforms the OLTP data into a Data Warehouse by generating association rules.
The CSCP system helps reveal relations among the diseases. The CSCP system
predicts the correlation(s) among primary disease (the disease for which the
patient visits the doctor) and secondary disease/s (which is/are other
associated disease/s carried by the same patient having the primary disease).Comment: 5 pages, MEDINFO 2010, C. Safran et al. (Eds.), IOS Pres
DisPredict: A Predictor of Disordered Protein Using Optimized RBF Kernel
Intrinsically disordered proteins or, regions perform important biological functions through their dynamic conformations during binding. Thus accurate identification of these disordered regions have significant implications in proper annotation of function, induced fold prediction and drug design to combat critical diseases. We introduce DisPredict, a disorder predictor that employs a single support vector machine with RBF kernel and novel features for reliable characterization of protein structure. DisPredict yields effective performance. In addition to 10-fold cross validation, training and testing of DisPredict was conducted with independent test datasets. The results were consistent with both the training and test error minimal. The use of multiple data sources, makes the predictor generic. The datasets used in developing the model include disordered regions of various length which are categorized as short and long having different compositions, different types of disorder, ranging from fully to partially disordered regions as well as completely ordered regions. Through comparison with other state of the art approaches and case studies, DisPredict is found to be a useful tool with competitive performance. DisPredict is available at https://github.com/tamjidul/DisPredict_v1.0
PCaAnalyser: A 2D-Image Analysis Based Module for Effective Determination of Prostate Cancer Progression in 3D Culture
Three-dimensional (3D) in vitro cell based assays for Prostate Cancer (PCa) research are rapidly becoming the preferred alternative to that of conventional 2D monolayer cultures. 3D assays more precisely mimic the microenvironment found in vivo, and thus are ideally suited to evaluate compounds and their suitability for progression in the drug discovery pipeline. To achieve the desired high throughput needed for most screening programs, automated quantification of 3D cultures is required. Towards this end, this paper reports on the development of a prototype analysis module for an automated high-content-analysis (HCA) system, which allows for accurate and fast investigation of in vitro 3D cell culture models for PCa. The Java based program, which we have named PCaAnalyser, uses novel algorithms that allow accurate and rapid quantitation of protein expression in 3D cell culture. As currently configured, the PCaAnalyser can quantify a range of biological parameters including: nuclei-count, nuclei-spheroid membership prediction, various function based classification of peripheral and non-peripheral areas to measure expression of biomarkers and protein constituents known to be associated with PCa progression, as well as defining segregate cellular-objects effectively for a range of signal-to-noise ratios. In addition, PCaAnalyser architecture is highly flexible, operating as a single independent analysis, as well as in batch mode; essential for High-Throughput-Screening (HTS). Utilising the PCaAnalyser, accurate and rapid analysis in an automated high throughput manner is provided, and reproducible analysis of the distribution and intensity of well-established markers associated with PCa progression in a range of metastatic PCa cell-lines (DU145 and PC3) in a 3D model demonstrated
Critical assessment of protein intrinsic disorder prediction
Abstract: Intrinsically disordered proteins, defying the traditional protein structure–function paradigm, are a challenge to study experimentally. Because a large part of our knowledge rests on computational predictions, it is crucial that their accuracy is high. The Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment was established as a community-based blind test to determine the state of the art in prediction of intrinsically disordered regions and the subset of residues involved in binding. A total of 43 methods were evaluated on a dataset of 646 proteins from DisProt. The best methods use deep learning techniques and notably outperform physicochemical methods. The top disorder predictor has Fmax = 0.483 on the full dataset and Fmax = 0.792 following filtering out of bona fide structured regions. Disordered binding regions remain hard to predict, with Fmax = 0.231. Interestingly, computing times among methods can vary by up to four orders of magnitude
Genetic algorithm for Ab initio protein structure prediction based on low resolution models
Protein is a sequence of amino acids bounded into a linear chain that adopts a specific folded three-dimensional (3D) shape. This specific folded shape enables protein to perform specific tasks. Amongst various available computational methods, the protein structure prediction by the ab initio approach is promising and can help to unravel the relationship between sequence and its associated structure. This thesis is focused on the ab initio protein structure prediction (PSP), by developing novel Genetic Algorithm (GA) for an efficient and effective conformation search of low resolution models derived from the two-bead hydrophobichydrophilic (HP) models. The thesis also proposes a novel low resolution model, called hHPNX model providing more accurate predictions compared to the existing low resolution HP models. As a search technique, GA shows promise in the complex search landscape for investigating the PSP problem. However, for longer sequences the performance of GA can deteriorate and cause the algorithm to frequently stall or become stuck in local minima. Therefore, in this thesis, a critical analysis of the working principle of GA (i.e., the schemata theorem) is presented. This analysis leads to the generalisation of the schemata theorem. The fallacies in the selection procedure of the schemata theorem are removed and its crossover operation has been fully defined. A novel concept, a chromosome correlation factor (CCF), is proposed to identify similar chromosomes within the GA population, and the optimal value of CCF enables GA to perform effectively and thus helps provide superior results. Further, a non-isomorphic encoding algorithm is proposed for a bijective encoding within GA that prevents the expansion of the search landscape by maintaining a 1:1 relationship between the genotype and the phenotype. The non-isomorphic encoding reduces the chances of GA stalling and also prevents the tendency of the normal stochastic GA search to behave like a random search. Since the PSP solutions are compact in nature, the simple GA developed without any heuristics is further improved as hybrid GA (HGA) by utilising domain-specific knowledge. For an optimal core cavity, we have defined likely sub-conformations to provide guided search. Further, the multi-objective formulation of the search problem can overcome possible stall or stuck conditions by backtracking effectively and performing efficiently. Novel and effective move operators are designed and applied to efficiently move part of the converging compact conformation and thus achieve overall superior results. The simplified HP model and its extension, the HPNX model, are effective in exploring the convoluted PSP search landscape quickly. With its simplicity maintained, the HPNX is extended to a novel model called hHPNX model, which reduces the amount of degeneracy and which additionally captures the characteristics oftwo distinguished amino acids (Alanine and Valine) from the hydrophobic group. A corrected interaction potential matrix for an existing YhHX model is proposed, leading to its correct representation. Further, the facecentred- cube (FCC) model is shown to have the optimal lattice configuration for closely mapping the real folded protein. Three novel techniques are developed to compute the fitness function efficiently, to reduce the computation time. Most importantly, improvement in the speed of computation is achieved without sacrificing the accuracy of the prediction. All the techniques are complementary to each other and can work concurrently thereby reducing the computation time significantly
Estimation of Position Specific Energy as a Feature of Protein Residues from Sequence Alone for Structural Classification.
A set of features computed from the primary amino acid sequence of proteins, is crucial in the process of inducing a machine learning model that is capable of accurately predicting three-dimensional protein structures. Solutions for existing protein structure prediction problems are in need of features that can capture the complexity of molecular level interactions. With a view to this, we propose a novel approach to estimate position specific estimated energy (PSEE) of a residue using contact energy and predicted relative solvent accessibility (RSA). Furthermore, we demonstrate PSEE can be reasonably estimated based on sequence information alone. PSEE is useful in identifying the structured as well as unstructured or, intrinsically disordered region of a protein by computing favorable and unfavorable energy respectively, characterized by appropriate threshold. The most intriguing finding, verified empirically, is the indication that the PSEE feature can effectively classify disorder versus ordered residues and can segregate different secondary structure type residues by computing the constituent energies. PSEE values for each amino acid strongly correlate with the hydrophobicity value of the corresponding amino acid. Further, PSEE can be used to detect the existence of critical binding regions that essentially undergo disorder-to-order transitions to perform crucial biological functions. Towards an application of disorder prediction using the PSEE feature, we have rigorously tested and found that a support vector machine model informed by a set of features including PSEE consistently outperforms a model with an identical set of features with PSEE removed. In addition, the new disorder predictor, DisPredict2, shows competitive performance in predicting protein disorder when compared with six existing disordered protein predictors
Genetic algorithm for Ab initio protein structure prediction based on low resolution models
Protein is a sequence of amino acids bounded into a linear chain that adopts a specific folded
three-dimensional (3D) shape. This specific folded shape enables protein to perform specific
tasks. Amongst various available computational methods, the protein structure prediction by
the ab initio approach is promising and can help to unravel the relationship between sequence
and its associated structure. This thesis is focused on the ab initio protein structure prediction
(PSP), by developing novel Genetic Algorithm (GA) for an efficient and effective
conformation search of low resolution models derived from the two-bead hydrophobichydrophilic
(HP) models. The thesis also proposes a novel low resolution model, called
hHPNX model providing more accurate predictions compared to the existing low resolution
HP models.
As a search technique, GA shows promise in the complex search landscape for
investigating the PSP problem. However, for longer sequences the performance of GA can
deteriorate and cause the algorithm to frequently stall or become stuck in local minima.
Therefore, in this thesis, a critical analysis of the working principle of GA (i.e., the schemata
theorem) is presented. This analysis leads to the generalisation of the schemata theorem. The
fallacies in the selection procedure of the schemata theorem are removed and its crossover
operation has been fully defined. A novel concept, a chromosome correlation factor (CCF), is
proposed to identify similar chromosomes within the GA population, and the optimal value of
CCF enables GA to perform effectively and thus helps provide superior results.
Further, a non-isomorphic encoding algorithm is proposed for a bijective encoding within
GA that prevents the expansion of the search landscape by maintaining a 1:1 relationship
between the genotype and the phenotype. The non-isomorphic encoding reduces the chances
of GA stalling and also prevents the tendency of the normal stochastic GA search to behave
like a random search.
Since the PSP solutions are compact in nature, the simple GA developed without any
heuristics is further improved as hybrid GA (HGA) by utilising domain-specific knowledge.
For an optimal core cavity, we have defined likely sub-conformations to provide guided
search. Further, the multi-objective formulation of the search problem can overcome possible
stall or stuck conditions by backtracking effectively and performing efficiently. Novel and
effective move operators are designed and applied to efficiently move part of the converging
compact conformation and thus achieve overall superior results.
The simplified HP model and its extension, the HPNX model, are effective in exploring
the convoluted PSP search landscape quickly. With its simplicity maintained, the HPNX is
extended to a novel model called hHPNX model, which reduces the amount of degeneracy
and which additionally captures the characteristics oftwo distinguished amino acids (Alanine
and Valine) from the hydrophobic group. A corrected interaction potential matrix for an
existing YhHX model is proposed, leading to its correct representation. Further, the facecentred-
cube (FCC) model is shown to have the optimal lattice configuration for closely
mapping the real folded protein.
Three novel techniques are developed to compute the fitness function efficiently, to
reduce the computation time. Most importantly, improvement in the speed of computation is
achieved without sacrificing the accuracy of the prediction. All the techniques are
complementary to each other and can work concurrently thereby reducing the computation
time significantly
Performance of ordered and disordered residue classification based on per residue PSEE value calculated using different contact radius (CR) values.
<p>Classification performance is shown in terms of (A) ACC (<i>blue bar</i>), (B) PPV (<i>purple bar</i>) and (C) MCC (<i>green bar</i>) for CR values equal to 4 to 30. The x-axis and y-axis show the CR values and the performance metric values, respectively.</p
- …